[HBASE-25357] allow specifying binary row key range to pre-split regions #72

Dieken · 2020-11-01T09:27:39Z

For example, the row key may start with a long integer, we can specify
ranges to pre-split regions:

import java.nio.charset.StandardCharsets;
import org.apache.hadoop.hbase.util.Bytes;

df.write()
  .format("org.apache.hadoop.hbase.spark")
  .option(HBaseTableCatalog.tableCatalog(), catalog)
  .option(HBaseTableCatalog.newTable(), 5)
  .option(HBaseTableCatalog.regionStart(), new String(Bytes.toBytes(0L), StandardCharsets.ISO_8859_1))
  .option(HBaseTableCatalog.regionEnd(), new String(Bytes.toBytes(2000000L), StandardCharsets.ISO_8859_1))
  .mode(SaveMode.Append)
  .save();

Apache-HBase · 2020-11-01T09:36:57Z

💔 -1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 4s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
-0 ⚠️	test4tests	0m 0s	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ master Compile Tests _
+1 💚	mvninstall	1m 23s	master passed
+1 💚	compile	0m 37s	master passed
+1 💚	scaladoc	0m 17s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	0m 44s	the patch passed
+1 💚	compile	0m 35s	the patch passed
+1 💚	scalac	0m 35s	the patch passed
-1 ❌	whitespace	0m 0s	The patch has 2 line(s) that end in whitespace. Use git apply --whitespace=fix <<patch_file>>. Refer https://git-scm.com/docs/git-apply
+1 💚	scaladoc	0m 16s	the patch passed
		_ Other Tests _
+1 💚	unit	3m 55s	hbase-spark in the patch passed.
		9m 0s

Subsystem	Report/Notes
Docker	ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/artifact/yetus-precommit-check/output/Dockerfile
GITHUB PR	#72
Optional Tests	dupname scalac scaladoc unit compile
uname	Linux 641d10548c92 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 GNU/Linux
Build tool	hb_maven
Personality	dev-support/jenkins/hbase-personality.sh
git revision	master / `b9706c8`
whitespace	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/artifact/yetus-precommit-check/output/whitespace-eol.txt
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/testReport/
Max. process+thread count	915 (vs. ulimit of 12500)
modules	C: spark/hbase-spark U: spark/hbase-spark
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/console
versions	git=2.20.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2020-11-03T04:41:01Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	4m 59s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 1s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
-0 ⚠️	test4tests	0m 0s	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ master Compile Tests _
+1 💚	mvninstall	12m 25s	master passed
+1 💚	compile	2m 18s	master passed
+1 💚	scaladoc	0m 23s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	2m 6s	the patch passed
+1 💚	compile	1m 49s	the patch passed
+1 💚	scalac	1m 49s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	scaladoc	0m 21s	the patch passed
		_ Other Tests _
+1 💚	unit	24m 8s	hbase-spark in the patch passed.
		48m 44s

Subsystem	Report/Notes
Docker	ClientAPI=1.40 ServerAPI=1.40 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/2/artifact/yetus-precommit-check/output/Dockerfile
GITHUB PR	#72
Optional Tests	dupname scalac scaladoc unit compile
uname	Linux 5951f3910863 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 GNU/Linux
Build tool	hb_maven
Personality	dev-support/jenkins/hbase-personality.sh
git revision	master / `b9706c8`
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/2/testReport/
Max. process+thread count	826 (vs. ulimit of 12500)
modules	C: spark/hbase-spark U: spark/hbase-spark
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/2/console
versions	git=2.20.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

meszibalu · 2020-11-30T14:34:56Z

@Dieken please create a Jira for this change if you want to get it merged. Thank you!

For example, the row key may start with a long integer, we can specify ranges to pre-split regions: ``` import java.nio.charset.StandardCharsets; import org.apache.hadoop.hbase.util.Bytes; df.write() .format("org.apache.hadoop.hbase.spark") .option(HBaseTableCatalog.tableCatalog(), catalog) .option(HBaseTableCatalog.newTable(), 5) .option(HBaseTableCatalog.regionStart(), new String(Bytes.toBytes(0L), StandardCharsets.ISO_8859_1)) .option(HBaseTableCatalog.regionEnd(), new String(Bytes.toBytes(2000000L), StandardCharsets.ISO_8859_1)) .mode(SaveMode.Append) .save(); ```

Dieken · 2020-12-04T04:18:08Z

@Dieken please create a Jira for this change if you want to get it merged. Thank you!

Created https://issues.apache.org/jira/browse/HBASE-25357

@meszibalu

wchevreuil · 2020-12-04T15:43:34Z

spark/hbase-spark/src/main/scala/org/apache/hadoop/hbase/spark/DefaultSource.scala

-      parameters.get(HBaseTableCatalog.regionEnd)
-        .getOrElse(HBaseTableCatalog.defaultRegionEnd))
+    val startKey = parameters.get(HBaseTableCatalog.regionStart)
+      .getOrElse(HBaseTableCatalog.defaultRegionStart).getBytes(StandardCharsets.ISO_8859_1)


I'm not sure it is a good idea to use different encoding from the default used by Bytes util converter (StandardCharsets.UTF_8), as many pieces of hbase code would rely on the Bytes converter, comparisons may become inconsistent.

Also, why you are using a different converter here, can you elaborate better what is the issue you are having within the builtin Bytes converter?

The spark option use string to pass parameters, not support directly passing bytes，I need pass binary row key so I have to interpreter binary bytes as ISO_8859_1 encoded String, it’s not valid UTF-8.

It’s a trick, does break backward compatibility for UTF-8 string containing characters beyond ISO_8859_1 charset, the UTF-8 string must be wrapped as explained in the JIRA issue.

I can’t figure out better way to pass bytes in spark option.

Apache-HBase · 2021-08-13T11:41:13Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 1s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
-0 ⚠️	test4tests	0m 0s	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ master Compile Tests _
+1 💚	mvninstall	1m 27s	master passed
+1 💚	compile	0m 37s	master passed
+1 💚	scaladoc	0m 46s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	0m 45s	the patch passed
+1 💚	compile	0m 39s	the patch passed
+1 💚	scalac	0m 39s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	scaladoc	0m 46s	the patch passed
		_ Other Tests _
+1 💚	unit	7m 3s	hbase-spark in the patch passed.
		13m 48s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/artifact/yetus-precommit-check/output/Dockerfile
GITHUB PR	#72
Optional Tests	dupname scalac scaladoc unit compile
uname	Linux b9487a03e2cc 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 GNU/Linux
Build tool	hb_maven
Personality	dev-support/jenkins/hbase-personality.sh
git revision	master / `fddb433`
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/testReport/
Max. process+thread count	918 (vs. ulimit of 12500)
modules	C: spark/hbase-spark U: spark/hbase-spark
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/console
versions	git=2.20.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-08-18T11:58:36Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 43s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
-0 ⚠️	test4tests	0m 0s	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ master Compile Tests _
+1 💚	mvninstall	1m 55s	master passed
+1 💚	compile	0m 49s	master passed
+1 💚	scaladoc	0m 54s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	0m 55s	the patch passed
+1 💚	compile	0m 48s	the patch passed
+1 💚	scalac	0m 48s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	scaladoc	0m 57s	the patch passed
		_ Other Tests _
+1 💚	unit	7m 19s	hbase-spark in the patch passed.
		16m 14s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/artifact/yetus-precommit-check/output/Dockerfile
GITHUB PR	#72
Optional Tests	dupname scalac scaladoc unit compile
uname	Linux 4cf38c84d016 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 GNU/Linux
Build tool	hb_maven
Personality	dev-support/jenkins/hbase-personality.sh
git revision	master / `fddb433`
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/testReport/
Max. process+thread count	947 (vs. ulimit of 12500)
modules	C: spark/hbase-spark U: spark/hbase-spark
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/console
versions	git=2.20.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2021-09-09T21:16:32Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 0s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
-0 ⚠️	test4tests	0m 0s	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ master Compile Tests _
+1 💚	mvninstall	1m 39s	master passed
+1 💚	compile	0m 37s	master passed
+1 💚	scaladoc	0m 45s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	0m 47s	the patch passed
+1 💚	compile	0m 38s	the patch passed
+1 💚	scalac	0m 38s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	scaladoc	0m 48s	the patch passed
		_ Other Tests _
+1 💚	unit	7m 6s	hbase-spark in the patch passed.
		14m 1s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/artifact/yetus-precommit-check/output/Dockerfile
GITHUB PR	#72
Optional Tests	dupname scalac scaladoc unit compile
uname	Linux b17eab94ab3b 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 GNU/Linux
Build tool	hb_maven
Personality	dev-support/jenkins/hbase-personality.sh
git revision	master / `37aa8d5`
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/testReport/
Max. process+thread count	916 (vs. ulimit of 12500)
modules	C: spark/hbase-spark U: spark/hbase-spark
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/console
versions	git=2.20.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Apache-HBase · 2022-01-31T19:31:59Z

🎊 +1 overall

Vote	Subsystem	Runtime	Comment
+0 🆗	reexec	1m 41s	Docker mode activated.
		_ Prechecks _
+1 💚	dupname	0m 0s	No case conflicting files found.
+1 💚	@author	0m 0s	The patch does not contain any @author tags.
-0 ⚠️	test4tests	0m 0s	The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch.
		_ master Compile Tests _
+1 💚	mvninstall	3m 49s	master passed
+1 💚	compile	0m 36s	master passed
+1 💚	scaladoc	0m 45s	master passed
		_ Patch Compile Tests _
+1 💚	mvninstall	0m 44s	the patch passed
+1 💚	compile	0m 39s	the patch passed
+1 💚	scalac	0m 39s	the patch passed
+1 💚	whitespace	0m 0s	The patch has no whitespace issues.
+1 💚	scaladoc	0m 46s	the patch passed
		_ Other Tests _
+1 💚	unit	7m 24s	hbase-spark in the patch passed.
		17m 5s

Subsystem	Report/Notes
Docker	ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/artifact/yetus-precommit-check/output/Dockerfile
GITHUB PR	#72
Optional Tests	dupname scalac scaladoc unit compile
uname	Linux 6e1d66f37f33 5.4.0-1025-aws #25~18.04.1-Ubuntu SMP Fri Sep 11 12:03:04 UTC 2020 x86_64 GNU/Linux
Build tool	hb_maven
Personality	dev-support/jenkins/hbase-personality.sh
git revision	master / `2bfc5f1`
Test Results	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/testReport/
Max. process+thread count	917 (vs. ulimit of 12500)
modules	C: spark/hbase-spark U: spark/hbase-spark
Console output	https://ci-hadoop.apache.org/job/HBase/job/HBase-Connectors-PreCommit/job/PR-72/1/console
versions	git=2.20.1
Powered by	Apache Yetus 0.12.0 https://yetus.apache.org

This message was automatically generated.

Dieken force-pushed the specify-binary-row-key-range branch from 3244622 to 96bfc39 Compare November 3, 2020 03:51

Dieken changed the title ~~allow specifying binary row key range to pre-split regions~~ [HBASE-25357] allow specifying binary row key range to pre-split regions Dec 4, 2020

Dieken force-pushed the specify-binary-row-key-range branch from 96bfc39 to 41f2156 Compare December 4, 2020 04:17

wchevreuil reviewed Dec 4, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HBASE-25357] allow specifying binary row key range to pre-split regions #72

[HBASE-25357] allow specifying binary row key range to pre-split regions #72

Dieken commented Nov 1, 2020

Apache-HBase commented Nov 1, 2020

Apache-HBase commented Nov 3, 2020

meszibalu commented Nov 30, 2020

Dieken commented Dec 4, 2020

wchevreuil Dec 4, 2020 •

edited

Dieken Dec 8, 2020

Apache-HBase commented Aug 13, 2021

Apache-HBase commented Aug 18, 2021

Apache-HBase commented Sep 9, 2021

Apache-HBase commented Jan 31, 2022

[HBASE-25357] allow specifying binary row key range to pre-split regions #72

Are you sure you want to change the base?

[HBASE-25357] allow specifying binary row key range to pre-split regions #72

Conversation

Dieken commented Nov 1, 2020

Apache-HBase commented Nov 1, 2020

Apache-HBase commented Nov 3, 2020

meszibalu commented Nov 30, 2020

Dieken commented Dec 4, 2020

wchevreuil Dec 4, 2020 • edited

Choose a reason for hiding this comment

Dieken Dec 8, 2020

Choose a reason for hiding this comment

Apache-HBase commented Aug 13, 2021

Apache-HBase commented Aug 18, 2021

Apache-HBase commented Sep 9, 2021

Apache-HBase commented Jan 31, 2022

wchevreuil Dec 4, 2020 •

edited